Note: In this chapter we learn sampling distributions.

  1. First we will look at a simple activity like example.
  1. We will have sections named “DETOUR #”, we will learn some brand name distributions in these sections.

Let’s begin…

1 Sampling Distribution of the sample proportion

1.1 What proportion of this bowl’s balls are red?

Take a look at the bowl in the following Figure. It has a certain number of red and a certain number of white balls all of equal size. Furthermore, it appears the bowl has been mixed beforehand as there does not seem to be any particular pattern to the spatial distribution of red and white balls.

Let’s now ask ourselves, what proportion of this bowl’s balls are red?

One way to answer this question would be to perform an exhaustive count: remove each ball individually, count the number of red balls and the number of white balls, and divide the number of red balls by the total number of balls. However this would be a long and tedious process.

Observe that ____ of the balls are red and there are a total of ____ balls and thus ___ % of the shovel’s balls are red. We can view the proportion of balls that are red in this shovel as a guess of the proportion of balls that are red in the entire bowl. While not as exact as doing an exhaustive count, our guess of ___% took much less time and energy to obtain.

However, say, we started this activity over from the beginning. In other words, we replace the 50 balls back into the bowl and start over. Would we remove exactly 17 red balls again? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% again? Maybe?

What if we repeated this exercise several times? Would I obtain exactly 17 red balls each time? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% every time? Surely not.

Let’s try do this on the computer…

To this end, we use a data frame bowl in the moderndive package whose rows correspond exactly with the contents of the actual bowl.

head(bowl)
# A tibble: 6 x 2
  ball_ID color
    <int> <chr>
1       1 white
2       2 white
3       3 white
4       4 red  
5       5 white
6       6 white
# View(bowl) # Use this in the console

Observe in the output that bowl has ___ rows, telling us that the bowl contains ___ equally-sized balls. The first variable ball_ID is used merely as an “identification variable”, none of the balls in the actual bowl are marked with numbers. The second variable color indicates whether a particular virtual ball is red or white.

Now that we have a virtual analogue of our bowl, we now need a virtual analogue for the shovel seen in Figure 2; we’ll use this virtual shovel to generate our virtual random samples of 50 balls. We’re going to use the rep_sample_n() function included in the moderndive package. This function allows us to take repeated, or replicated, samples of size n. Run the following and explore.

virtual_shovel <- bowl %>% 
  rep_sample_n(size = 50)

virtual_shovel
# A tibble: 50 x 3
# Groups:   replicate [1]
   replicate ball_ID color
       <int>   <int> <chr>
 1         1    1804 red  
 2         1     308 red  
 3         1    1663 white
 4         1     345 white
 5         1    2125 white
 6         1    2025 red  
 7         1    2204 white
 8         1     813 white
 9         1     352 red  
10         1    1962 white
# … with 40 more rows

Next we can find out how many res ones are there in our virtual_shovel

virtual_shovel %>% 
  summarize(num_red = sum(color=="red"))  
# A tibble: 1 x 2
  replicate num_red
      <int>   <int>
1         1      20

How about the proportion on red? We can use the mutate (new) function to create a new variable, in this case prop_red.

virtual_shovel %>% 
  summarize(num_red = sum(color == "red")) %>% 
  mutate(prop_red = num_red / 50)
# A tibble: 1 x 3
  replicate num_red prop_red
      <int>   <int>    <dbl>
1         1      20      0.4

1.2 Using the virtual shovel many times

virtual_samples <- bowl %>% 
  rep_sample_n(size = 50, reps = 30)

kable(virtual_samples)
replicate ball_ID color
1 1614 red
1 2146 white
1 841 white
1 130 red
1 1456 red
1 1100 white
1 742 white
1 1423 white
1 203 red
1 1631 red
1 271 white
1 860 white
1 1321 white
1 2314 white
1 284 white
1 497 white
1 373 white
1 219 white
1 200 red
1 736 red
1 1759 white
1 1993 white
1 2139 white
1 929 red
1 303 red
1 2140 white
1 460 white
1 52 red
1 2074 red
1 1912 white
1 2344 red
1 470 red
1 866 white
1 2025 red
1 1589 red
1 365 white
1 646 white
1 627 red
1 1616 red
1 521 white
1 258 red
1 601 red
1 1900 white
1 2113 white
1 2061 white
1 2104 white
1 640 white
1 1024 red
1 2370 white
1 1876 white
2 1970 white
2 2218 white
2 1782 white
2 795 white
2 771 white
2 182 white
2 1408 white
2 386 white
2 2225 white
2 179 red
2 837 red
2 1244 red
2 842 red
2 1299 red
2 22 white
2 1864 white
2 2198 white
2 1084 white
2 1827 red
2 2071 red
2 1303 red
2 427 red
2 990 white
2 653 white
2 1349 white
2 1137 white
2 1527 white
2 1471 red
2 157 red
2 1167 red
2 1461 red
2 2117 white
2 1369 red
2 66 red
2 2268 red
2 1590 white
2 2335 white
2 2021 white
2 1402 white
2 2341 white
2 1373 red
2 2072 white
2 2115 white
2 924 red
2 2248 red
2 2127 white
2 1340 white
2 1319 white
2 1374 red
2 1106 white
3 203 red
3 1690 white
3 232 red
3 1485 white
3 1892 white
3 555 white
3 396 red
3 817 white
3 1966 red
3 2315 white
3 1064 white
3 830 red
3 785 white
3 2314 white
3 462 red
3 826 red
3 1144 red
3 606 red
3 1782 white
3 1468 white
3 1867 white
3 972 red
3 717 white
3 1524 white
3 2053 red
3 312 white
3 2274 red
3 641 white
3 853 white
3 841 white
3 98 red
3 1942 white
3 1003 white
3 805 white
3 1797 white
3 1905 white
3 854 red
3 374 white
3 2225 white
3 627 red
3 1149 white
3 1330 white
3 1400 white
3 833 red
3 947 white
3 73 white
3 1105 white
3 1339 white
3 1178 white
3 1992 red
4 563 white
4 1499 red
4 692 white
4 203 red
4 1035 white
4 588 red
4 1453 red
4 329 white
4 2043 red
4 298 red
4 110 white
4 284 white
4 1031 red
4 1929 red
4 2044 white
4 302 white
4 344 white
4 81 white
4 1080 red
4 1148 red
4 564 white
4 1321 white
4 2017 white
4 2258 white
4 840 white
4 143 red
4 1974 red
4 2265 red
4 1602 red
4 1176 red
4 371 white
4 663 red
4 565 red
4 1980 white
4 342 white
4 784 white
4 414 red
4 1616 red
4 1553 white
4 245 white
4 2020 red
4 2254 white
4 490 white
4 1753 red
4 1178 white
4 1829 white
4 1559 white
4 1713 white
4 553 white
4 2334 red
5 1655 red
5 426 white
5 228 white
5 2067 red
5 605 white
5 1791 white
5 529 white
5 1494 red
5 905 white
5 1295 white
5 360 red
5 1802 white
5 1090 white
5 2261 red
5 1403 white
5 869 white
5 2189 white
5 1205 red
5 2203 white
5 1151 white
5 902 red
5 87 red
5 633 red
5 859 white
5 1077 white
5 819 red
5 1705 white
5 2068 red
5 1271 red
5 473 white
5 1220 white
5 2188 red
5 2093 white
5 92 white
5 2029 white
5 1658 red
5 1477 red
5 734 white
5 1379 white
5 736 red
5 1374 red
5 2367 white
5 2341 white
5 2202 white
5 828 white
5 1913 white
5 1952 white
5 1660 white
5 995 white
5 2088 white
6 2209 red
6 474 red
6 64 red
6 1324 red
6 1728 red
6 1606 white
6 1831 white
6 1387 white
6 252 white
6 1856 white
6 1768 white
6 620 red
6 1228 red
6 24 white
6 1885 red
6 1021 red
6 697 white
6 1240 white
6 2125 white
6 1053 red
6 1209 white
6 2070 white
6 2186 red
6 2052 white
6 650 white
6 827 red
6 1313 red
6 731 white
6 2371 white
6 1257 white
6 1452 red
6 1451 red
6 869 white
6 363 white
6 2193 red
6 1017 red
6 668 red
6 568 red
6 240 red
6 1572 white
6 479 white
6 1195 red
6 1485 white
6 267 white
6 329 white
6 1527 white
6 622 red
6 291 white
6 414 red
6 2372 white
7 1297 white
7 1704 red
7 2091 red
7 1602 red
7 964 white
7 209 red
7 158 red
7 2031 white
7 587 white
7 2198 white
7 1932 white
7 1940 red
7 1415 red
7 2203 white
7 2130 white
7 1527 white
7 791 white
7 1939 white
7 2175 white
7 75 red
7 1670 white
7 1529 red
7 1813 white
7 328 white
7 1787 white
7 1810 red
7 416 white
7 1900 white
7 1692 white
7 530 red
7 367 red
7 129 red
7 741 white
7 1245 white
7 177 red
7 2225 white
7 920 red
7 1906 red
7 68 white
7 483 white
7 2147 white
7 240 red
7 1509 white
7 2053 red
7 2088 white
7 1554 white
7 270 red
7 128 white
7 86 white
7 382 white
8 2348 red
8 937 red
8 142 white
8 2039 white
8 1239 white
8 1291 white
8 833 red
8 1015 white
8 608 white
8 1458 white
8 2292 white
8 155 red
8 1107 white
8 1264 white
8 1153 white
8 1542 white
8 746 white
8 447 red
8 1248 white
8 189 white
8 871 red
8 464 white
8 1442 white
8 423 white
8 57 white
8 152 red
8 276 red
8 787 red
8 11 white
8 17 red
8 2168 white
8 675 white
8 1978 white
8 396 red
8 2340 red
8 1706 white
8 1359 red
8 1067 white
8 2029 white
8 676 red
8 1111 white
8 1889 red
8 1510 white
8 24 white
8 228 white
8 2161 red
8 2202 white
8 1412 white
8 1672 white
8 1225 white
9 139 white
9 883 red
9 1732 white
9 2198 white
9 2248 red
9 2391 white
9 1182 white
9 1501 white
9 630 white
9 652 white
9 1814 white
9 63 white
9 38 white
9 2397 red
9 1750 white
9 1319 white
9 1939 white
9 2269 red
9 1394 red
9 1016 red
9 543 red
9 1076 white
9 90 white
9 1738 red
9 1305 white
9 423 white
9 557 white
9 2311 white
9 1927 red
9 183 red
9 446 white
9 1057 white
9 779 red
9 103 white
9 1960 white
9 43 white
9 1124 white
9 1021 red
9 1168 red
9 1724 white
9 784 white
9 595 white
9 979 white
9 1854 white
9 268 white
9 2027 red
9 855 red
9 1502 white
9 404 white
9 783 white
10 2140 white
10 2095 white
10 2220 red
10 776 white
10 1615 white
10 1606 white
10 244 white
10 1910 white
10 1182 white
10 1975 white
10 1481 white
10 1973 white
10 372 white
10 336 white
10 279 white
10 1741 red
10 1639 white
10 658 red
10 1163 white
10 938 red
10 846 red
10 2201 red
10 841 white
10 571 white
10 891 red
10 1836 white
10 640 white
10 804 white
10 1596 red
10 1324 red
10 1907 white
10 987 white
10 2361 white
10 727 white
10 1500 white
10 1835 red
10 762 white
10 1934 white
10 1321 white
10 367 red
10 242 red
10 15 red
10 597 red
10 1436 white
10 1088 red
10 2289 white
10 1205 red
10 2047 white
10 445 red
10 1859 red
11 706 white
11 1468 white
11 338 white
11 1816 white
11 938 red
11 1458 white
11 456 white
11 410 red
11 135 red
11 966 red
11 982 white
11 1193 white
11 574 white
11 573 white
11 638 white
11 2118 white
11 1076 white
11 2093 white
11 2043 red
11 690 red
11 1470 white
11 921 red
11 2069 white
11 1242 red
11 1465 red
11 2108 white
11 2293 white
11 312 white
11 1524 white
11 1081 white
11 1557 red
11 1647 white
11 1357 white
11 860 white
11 2214 white
11 642 white
11 2345 white
11 1172 red
11 345 white
11 1514 white
11 2327 red
11 210 red
11 1098 red
11 606 red
11 776 white
11 808 red
11 1607 white
11 2138 white
11 1651 white
11 2045 white
12 70 white
12 1499 red
12 580 white
12 1895 white
12 71 white
12 139 white
12 100 red
12 1198 red
12 1348 red
12 520 white
12 1463 red
12 40 red
12 1931 white
12 2061 white
12 907 red
12 742 white
12 1741 red
12 534 red
12 1867 white
12 468 red
12 1980 white
12 1441 white
12 1246 red
12 334 white
12 645 white
12 1550 white
12 1698 white
12 428 white
12 1306 white
12 148 white
12 753 white
12 1352 white
12 500 white
12 1144 red
12 806 red
12 1189 white
12 1211 white
12 2107 white
12 490 white
12 2116 red
12 356 red
12 963 white
12 1878 white
12 902 red
12 932 white
12 245 white
12 1084 white
12 1254 white
12 1475 white
12 995 white
13 2271 red
13 47 white
13 2019 red
13 2139 white
13 2105 red
13 914 white
13 1184 white
13 2330 red
13 2154 white
13 1541 red
13 689 red
13 1059 white
13 2095 white
13 1728 red
13 1626 white
13 879 white
13 2245 red
13 245 white
13 1844 red
13 1311 white
13 2270 white
13 1439 red
13 782 white
13 609 white
13 153 red
13 461 white
13 1684 white
13 2302 red
13 663 red
13 226 white
13 145 white
13 283 red
13 1044 white
13 1124 white
13 599 white
13 294 white
13 340 white
13 1215 white
13 1556 red
13 1770 red
13 1555 red
13 1673 white
13 908 white
13 936 white
13 1218 white
13 1468 white
13 1953 red
13 980 white
13 2009 red
13 31 red
14 1919 red
14 320 white
14 1366 white
14 1051 red
14 713 red
14 612 red
14 1483 white
14 1489 white
14 1286 white
14 2321 red
14 2381 white
14 2353 white
14 933 white
14 1569 white
14 2111 white
14 1436 white
14 1933 red
14 1529 red
14 768 white
14 1396 white
14 240 red
14 308 red
14 1264 white
14 1037 red
14 2345 white
14 1165 white
14 1287 white
14 319 white
14 2001 red
14 301 red
14 118 red
14 2283 red
14 814 red
14 2134 white
14 1193 white
14 458 white
14 1623 white
14 1350 white
14 1337 white
14 792 white
14 1566 white
14 1760 red
14 519 white
14 97 red
14 1824 red
14 1192 white
14 1089 white
14 243 red
14 1012 white
14 348 red
15 286 white
15 1987 white
15 1747 white
15 161 white
15 1586 white
15 18 red
15 1897 white
15 1271 red
15 1309 red
15 379 white
15 1602 red
15 1279 white
15 882 white
15 97 red
15 99 white
15 1694 white
15 264 white
15 358 white
15 1718 white
15 1223 red
15 1619 white
15 861 red
15 913 white
15 734 white
15 1498 white
15 1834 white
15 1622 red
15 714 white
15 461 white
15 1241 white
15 1620 red
15 2196 red
15 30 red
15 727 white
15 1461 red
15 331 red
15 686 red
15 1974 red
15 945 white
15 2205 red
15 742 white
15 1163 white
15 2290 white
15 505 red
15 1198 red
15 446 white
15 702 white
15 1832 white
15 2214 white
15 2260 white
16 1813 white
16 911 white
16 916 white
16 2327 red
16 1212 red
16 232 red
16 977 white
16 1405 red
16 1757 white
16 1768 white
16 566 white
16 580 white
16 1789 red
16 439 white
16 595 white
16 1431 white
16 2354 white
16 1552 white
16 1403 white
16 940 red
16 2268 red
16 2182 white
16 25 red
16 1498 white
16 1179 white
16 1313 red
16 2329 white
16 1590 white
16 764 white
16 386 white
16 1398 white
16 1712 red
16 2223 white
16 1105 white
16 2069 white
16 1317 white
16 553 white
16 21 red
16 860 white
16 1030 white
16 1490 white
16 1058 red
16 2027 red
16 1483 white
16 2238 red
16 2176 white
16 854 red
16 1630 white
16 1769 white
16 87 red
17 2071 red
17 2078 red
17 517 white
17 1940 red
17 1431 white
17 76 white
17 479 white
17 186 white
17 1386 red
17 1215 white
17 514 white
17 147 white
17 280 white
17 1679 white
17 945 white
17 2064 white
17 170 white
17 1728 red
17 1445 red
17 1328 red
17 960 white
17 2212 white
17 995 white
17 2266 red
17 1298 white
17 2167 red
17 600 white
17 1444 red
17 1248 white
17 37 white
17 240 red
17 1996 white
17 762 white
17 531 white
17 2396 white
17 863 white
17 1182 white
17 472 white
17 1461 red
17 1179 white
17 2144 white
17 192 white
17 642 white
17 223 red
17 1843 red
17 750 white
17 1542 white
17 2039 white
17 574 white
17 1933 red
18 421 white
18 1855 red
18 512 white
18 1889 red
18 136 red
18 1911 white
18 1293 white
18 275 white
18 2284 red
18 311 white
18 1467 white
18 726 red
18 881 red
18 208 red
18 802 red
18 1060 white
18 281 white
18 890 white
18 2132 white
18 618 white
18 123 white
18 1820 red
18 474 red
18 791 white
18 1995 red
18 1101 white
18 1418 red
18 1555 red
18 1845 white
18 1159 white
18 1816 white
18 396 red
18 1916 white
18 1753 red
18 1265 red
18 2039 white
18 550 red
18 484 white
18 232 red
18 376 red
18 598 white
18 295 red
18 338 white
18 482 red
18 1812 white
18 1465 red
18 1309 red
18 1139 red
18 555 white
18 1926 red
19 172 white
19 1477 red
19 2013 white
19 1364 white
19 533 white
19 1376 red
19 884 white
19 1083 red
19 2147 white
19 1816 white
19 395 red
19 1046 red
19 1897 white
19 2094 red
19 945 white
19 1576 white
19 1 white
19 1846 red
19 225 white
19 2357 white
19 378 white
19 655 white
19 1833 white
19 1641 red
19 1612 white
19 1239 white
19 1918 white
19 230 white
19 1541 red
19 1346 red
19 845 white
19 1924 red
19 1307 red
19 1143 white
19 369 white
19 2081 white
19 1105 white
19 1130 white
19 1378 red
19 456 white
19 2021 white
19 1821 white
19 1736 white
19 1388 white
19 975 white
19 263 white
19 327 white
19 2301 red
19 1581 white
19 1511 white
20 1450 red
20 1915 white
20 1828 white
20 97 red
20 264 white
20 1158 white
20 2106 white
20 573 white
20 858 white
20 2149 white
20 76 white
20 1282 white
20 1681 red
20 1335 white
20 2276 white
20 332 red
20 1133 white
20 2162 red
20 631 red
20 189 white
20 1722 white
20 2012 red
20 1496 white
20 1781 white
20 350 white
20 50 red
20 220 red
20 989 white
20 1417 white
20 202 red
20 550 red
20 188 white
20 302 white
20 1588 red
20 1847 white
20 1452 red
20 174 red
20 1564 white
20 773 white
20 1221 white
20 242 red
20 2192 red
20 1266 white
20 2375 red
20 1682 white
20 1207 white
20 5 white
20 2261 red
20 719 white
20 1061 red
21 2058 red
21 978 white
21 1323 red
21 392 red
21 1360 white
21 660 red
21 2392 red
21 966 red
21 2124 white
21 814 red
21 193 white
21 961 white
21 1049 red
21 1224 white
21 1825 white
21 2117 white
21 647 white
21 1632 red
21 541 white
21 1895 white
21 1205 red
21 873 white
21 51 white
21 275 white
21 525 red
21 1833 white
21 1863 white
21 2213 white
21 1544 white
21 1227 red
21 1490 white
21 1497 white
21 95 white
21 1321 white
21 378 white
21 1578 red
21 434 white
21 1874 white
21 2188 red
21 179 red
21 612 red
21 1216 white
21 203 red
21 967 white
21 44 white
21 1293 white
21 1438 white
21 2043 red
21 2331 white
21 20 white
22 1927 red
22 2381 white
22 1159 white
22 726 red
22 709 white
22 1694 white
22 645 white
22 532 red
22 1627 white
22 575 white
22 1478 white
22 476 red
22 862 red
22 305 white
22 516 red
22 716 white
22 1917 red
22 1856 white
22 2327 red
22 1243 white
22 2192 red
22 168 red
22 469 red
22 341 white
22 1909 red
22 786 red
22 827 red
22 1725 white
22 45 white
22 335 white
22 1849 red
22 602 red
22 1156 white
22 2158 red
22 910 white
22 1561 white
22 1663 white
22 1346 red
22 2382 red
22 538 white
22 554 red
22 901 white
22 2233 white
22 2111 white
22 500 white
22 1542 white
22 1473 red
22 188 white
22 816 red
22 340 white
23 1951 white
23 938 red
23 1068 red
23 220 red
23 2125 white
23 2297 white
23 2156 white
23 785 white
23 2165 white
23 809 white
23 954 white
23 2231 white
23 1362 red
23 1103 red
23 1267 red
23 2246 red
23 425 red
23 1626 white
23 1746 white
23 1360 white
23 534 red
23 344 white
23 440 white
23 1583 red
23 2400 white
23 485 red
23 2055 red
23 902 red
23 1166 white
23 202 red
23 339 red
23 242 red
23 2171 white
23 263 white
23 1791 white
23 165 white
23 505 red
23 887 white
23 464 white
23 578 white
23 598 white
23 864 red
23 10 white
23 2310 red
23 524 white
23 555 white
23 1742 red
23 1940 red
23 1356 red
23 394 white
24 507 white
24 904 white
24 1341 white
24 1597 red
24 65 white
24 1814 white
24 828 white
24 710 white
24 879 white
24 696 white
24 1828 white
24 1860 white
24 2051 white
24 232 red
24 980 white
24 1197 white
24 1949 white
24 2113 white
24 1302 white
24 627 red
24 64 red
24 1523 red
24 1423 white
24 2349 red
24 1217 red
24 2336 white
24 500 white
24 2125 white
24 213 white
24 1099 white
24 1934 white
24 1169 white
24 2293 white
24 57 white
24 21 red
24 2011 red
24 2142 red
24 1151 white
24 282 white
24 154 white
24 2064 white
24 743 white
24 1343 white
24 685 white
24 581 red
24 297 white
24 24 white
24 1225 white
24 301 red
24 950 white
25 149 red
25 2399 white
25 980 white
25 8 white
25 1728 red
25 511 white
25 1854 white
25 1118 red
25 842 red
25 1528 white
25 969 white
25 1956 white
25 1944 red
25 1968 white
25 2369 red
25 190 red
25 1511 white
25 403 white
25 303 red
25 1053 red
25 793 white
25 2020 red
25 956 red
25 522 red
25 1512 red
25 2308 white
25 1471 red
25 868 red
25 695 white
25 1785 red
25 646 white
25 1343 white
25 2053 red
25 1565 white
25 818 red
25 1040 red
25 759 red
25 1076 white
25 669 white
25 1411 red
25 200 red
25 2286 white
25 563 white
25 999 red
25 2067 red
25 1590 white
25 2022 white
25 1994 white
25 610 white
25 217 white
26 577 white
26 767 white
26 408 red
26 2288 white
26 607 white
26 762 white
26 1444 red
26 2321 red
26 1435 red
26 1162 red
26 59 white
26 2320 white
26 2219 white
26 289 red
26 1099 white
26 1229 red
26 243 red
26 1216 white
26 1781 white
26 2226 red
26 1814 white
26 1982 red
26 226 white
26 327 white
26 1575 red
26 1217 red
26 2191 red
26 390 red
26 1236 red
26 1866 white
26 1019 white
26 393 white
26 398 white
26 789 white
26 1441 white
26 822 red
26 1048 white
26 1787 white
26 688 white
26 1893 white
26 2155 white
26 1601 white
26 120 white
26 1483 white
26 2133 white
26 1431 white
26 91 white
26 1976 white
26 2091 red
26 1628 red
27 2248 red
27 474 red
27 1537 red
27 2088 white
27 840 white
27 946 white
27 2391 white
27 1865 white
27 318 white
27 730 white
27 405 red
27 1219 white
27 1631 red
27 273 red
27 1293 white
27 450 red
27 391 red
27 219 white
27 1953 red
27 1304 red
27 788 white
27 144 white
27 1686 white
27 673 white
27 643 white
27 2046 red
27 1248 white
27 1727 red
27 1330 white
27 732 white
27 562 white
27 749 white
27 1228 red
27 159 red
27 22 white
27 978 white
27 1908 white
27 280 white
27 681 red
27 1856 white
27 1205 red
27 386 white
27 334 white
27 638 white
27 2165 white
27 2348 red
27 1402 white
27 1140 red
27 1244 red
27 435 white
28 833 red
28 1594 red
28 1056 white
28 363 white
28 2051 white
28 575 white
28 607 white
28 476 red
28 2284 red
28 1963 red
28 1829 white
28 1780 red
28 1539 white
28 51 white
28 971 red
28 2131 red
28 159 red
28 2109 white
28 1942 white
28 373 white
28 1893 white
28 1140 red
28 2244 white
28 1644 white
28 290 white
28 1165 white
28 1387 white
28 342 white
28 681 red
28 2212 white
28 948 white
28 951 white
28 2364 white
28 421 white
28 2304 red
28 2251 red
28 320 white
28 531 white
28 252 white
28 439 white
28 1251 white
28 830 red
28 1398 white
28 1827 red
28 876 red
28 2086 white
28 581 red
28 584 white
28 1089 white
28 1424 white
29 1532 red
29 434 white
29 911 white
29 1333 white
29 755 red
29 2057 red
29 169 white
29 1868 red
29 864 red
29 699 red
29 193 white
29 1752 white
29 990 white
29 633 red
29 139 white
29 2215 white
29 609 white
29 1001 white
29 1392 white
29 1177 red
29 1216 white
29 1137 white
29 600 white
29 932 white
29 368 white
29 2125 white
29 1485 white
29 549 white
29 379 white
29 1365 white
29 2362 red
29 2190 red
29 1654 white
29 676 red
29 1879 red
29 648 white
29 1899 white
29 2080 white
29 1735 red
29 1166 white
29 463 red
29 529 white
29 859 white
29 2216 red
29 1608 red
29 1941 white
29 2137 white
29 2213 white
29 208 red
29 1832 white
30 1137 white
30 1481 white
30 1023 white
30 952 red
30 875 white
30 573 white
30 2290 white
30 2181 white
30 1688 red
30 1072 white
30 1984 white
30 1218 white
30 1554 white
30 257 white
30 1352 white
30 2116 red
30 1532 red
30 1417 white
30 2061 white
30 2019 red
30 185 red
30 1650 red
30 1416 red
30 1295 white
30 834 red
30 501 red
30 1920 white
30 714 white
30 2224 red
30 297 white
30 733 white
30 553 white
30 353 white
30 2134 white
30 84 white
30 276 red
30 244 white
30 1872 white
30 1615 white
30 1801 white
30 707 red
30 671 white
30 2184 white
30 1957 red
30 207 red
30 373 white
30 1398 white
30 965 red
30 514 white
30 434 white

Observe that while the first 50 rows of replicate are equal to 1, the next 50 rows of replicate are equal to 2. This is telling us that the first 50 rows correspond to the first sample of 50 balls while the next 50 correspond to the second sample of 50 balls. This pattern continues for all reps = 30 replicates and thus virtual_samples has \(30 \times 50 = 1500\) rows.

virtual_prop_red <- virtual_samples %>% 
  group_by(replicate) %>% 
  summarize(red = sum(color == "red")) %>% 
  mutate(prop_red = red / 50)

virtual_prop_red
# A tibble: 30 x 3
   replicate   red prop_red
       <int> <int>    <dbl>
 1         1    20     0.4 
 2         2    20     0.4 
 3         3    17     0.34
 4         4    22     0.44
 5         5    17     0.34
 6         6    23     0.46
 7         7    19     0.38
 8         8    16     0.32
 9         9    15     0.3 
10        10    18     0.36
# … with 20 more rows
#kable(virtual_prop_red) # To see all 30 samples

Let’s visualize the distribution of these 33 proportions red based on 33 virtual samples using a histogram with binwidth = 0.05

ggplot(virtual_prop_red, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white", fill = "steelblue") +
  labs(x = "Proportion of 50 balls that were red", 
       title = "Distribution of 30 proportions red") 

Observe that occasionally we obtained proportions red that are less than ____, while on the other hand we occasionally we obtained proportions that are greater than ____. However, the most frequently occurring proportions red out of 50 balls were between ____ % and ____ % (for ___ out 30 samples). Why do we have these differences in proportions red? Because of ___________________.

Exercise 1.1 Redo the above activity with 1000 repeated samples and state your conclusions.

1.2.1 Using different shovels

If your goal was still to estimate the proportion of the bowl’s balls that were red, which shovel would you choose? Why? Let’s try to answer these questions.

# Segment 1: sample size = 25 ------------------------------
# 1.a) Virtually use shovel 1000 times
virtual_samples_25 <- bowl %>% 
  rep_sample_n(size = 25, reps = 1000)

# 1.b) Compute resulting 1000 replicates of proportion red
virtual_prop_red_25 <- virtual_samples_25 %>% 
  group_by(replicate) %>% 
  summarize(red = sum(color == "red")) %>% 
  mutate(prop_red = red / 25)

# 1.c) Plot distribution via a histogram
p1 <- ggplot(virtual_prop_red_25, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Pro of 25 balls that were red", title = "25") 

# Segment 2: sample size = 50 ------------------------------
# 2.a) Virtually use shovel 1000 times
virtual_samples_50 <- bowl %>% 
  rep_sample_n(size = 50, reps = 1000)

# 2.b) Compute resulting 1000 replicates of proportion red
virtual_prop_red_50 <- virtual_samples_50 %>% 
  group_by(replicate) %>% 
  summarize(red = sum(color == "red")) %>% 
  mutate(prop_red = red / 50)

# 2.c) Plot distribution via a histogram
p2 <- ggplot(virtual_prop_red_50, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Pro of 50 balls that were red", title = "50")  

# Segment 3: sample size = 100 ------------------------------
# 3.a) Virtually using shovel with 100 slots 1000 times
virtual_samples_100 <- bowl %>% 
  rep_sample_n(size = 100, reps = 1000)

# 3.b) Compute resulting 1000 replicates of proportion red
virtual_prop_red_100 <- virtual_samples_100 %>% 
  group_by(replicate) %>% 
  summarize(red = sum(color == "red")) %>% 
  mutate(prop_red = red / 100)

# 3.c) Plot distribution via a histogram
p3 <- ggplot(virtual_prop_red_100, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Pro of 100 balls that were red", title = "100") 


plot_grid(p1, p2, p3, nrow = 1)

Observe that as the sample size increases, the ______ of the 1000 replicates of the proportion red decreases. In other words, as the sample size increases, there are less differences due to sampling variation and the distribution centers more tightly around the same value. Eyeballing the above Figure, things appear to center tightly around roughly ____%.

# n = 25
virtual_prop_red_25 %>% 
  summarize(sd = sd(prop_red))
# A tibble: 1 x 1
      sd
   <dbl>
1 0.0944
# n = 50
virtual_prop_red_50 %>% 
  summarize(sd = sd(prop_red))
# A tibble: 1 x 1
      sd
   <dbl>
1 0.0681
# n = 100
virtual_prop_red_100 %>% 
  summarize(sd = sd(prop_red))
# A tibble: 1 x 1
      sd
   <dbl>
1 0.0478
Number of slots in shovel Standard deviation of proportions red
25 0.0978
50 0.0669
100 0.0471

As the sample size increases our numerical measure of spread decreases; there is less variation in our proportions red. In other words, as the sample size increases, our guesses at the true proportion of the bowl’s balls that are red get more consistent and precise.

1.3 What did we learn?

This was our first attempt at understanding two key concepts relating to sampling for estimation:

  1. The effect of sampling variation on our estimates.
  2. The effect of sample size on sampling variation.

Let’s now introduce some terminology and notation as well as statistical definitions related to sampling.

1.4 Terminology & notation

  1. (Study) Population: A (study) population is a collection of individuals or observations about which we are interested. We mathematically denote the population’s size using upper case N. In our simulations the (study) population was the collection of N = 2400 identically sized red and white balls contained in the bowl.

  2. Population parameter: A population parameter is a numerical summary quantity about the population that is unknown, but you wish you knew. For example, when this quantity is a mean, the population parameter of interest is the population mean which is mathematically denoted with the Greek letter \(\mu\) (pronounced “mu”). In our simulations however since we were interested in the proportion of the bowl’s balls that were red, the population parameter is the population proportion which is mathematically denoted with the letter \(p\).

  3. Census: An exhaustive enumeration or counting of all \(N\) individuals or observations in the population in order to compute the population parameter’s value exactly. In our simulations, this would correspond to manually going over all \(N = 2400\) balls in the bowl and counting the number that are red and computing the population proportion \(p\) of the balls that are red exactly. When the number \(N\) of individuals or observations in our population is large, as was the case with our bowl, a census can be very expensive in terms of time, energy, and money.

  4. Sampling: Sampling is the act of collecting a sample from the population when we don’t have the means to perform a census. We mathematically denote the sample’s size using lower case \(n\), as opposed to upper case \(N\) which denotes the population’s size. Typically the sample size \(n\) is much smaller than the population size \(N\), thereby making sampling a much cheaper procedure than a census. In our simulations, we used shovels with 25, 50, and 100 slots to extract a sample of size \(n = 25\), \(n = 50\), and \(n = 100\) balls.

  5. Point estimate (AKA sample statistic): A summary statistic computed from the sample that estimates the unknown population parameter. In our simulations, recall that the unknown population parameter was the population proportion and that this is mathematically denoted with p. Our point estimate is the sample proportion: the proportion of the shovel’s balls that are red. In other words, it is our guess of the proportion of the bowl’s balls balls that are red. We mathematically denote the sample proportion using \(\hat{p}\); the “hat” on top of the p indicates that it is an estimate of the unknown population proportion \(p\).

  6. Representative sampling: A sample is said be a representative sample if it is representative of the population. In other words, are the sample’s characteristics a good representation of the population’s characteristics? In our simulations, are the samples of \(n\) balls extracted using our shovels representative of the bowl’s \(N = 2400\) balls?

  7. Generalizability: We say a sample is generalizable if any results based on the sample can generalize to the population. In other words, can the value of the point estimate be generalized to estimate the value of the population parameter well? In our simulations, can we generalize the values of the sample proportions red of our shovels to the population proportion red of the bowl? Using mathematical notation, is \(\hat{p}\) a “good guess” of \(p\)?

  8. Bias: In a statistical sense, we say bias occurs if certain individuals or observations in a population have a higher chance of being included in a sample than others. We say a sampling procedure is unbiased if every observation in a population had an equal chance of being sampled. In our simulations, since each ball had the same size and hence an equal chance of being sample in our shovels, our samples were unbiased.

  9. Random sampling: We say a sampling procedure is random if we sample randomly from the population in an unbiased fashion. In our simulations, this would correspond to sufficiently mixing the bowl before each use of the shovel.

Let’s put them all together:

  • If we extract a sample of \(n=50\) balls at random, in other words we mix the equally-sized balls before using the shovel, then

  • the contents of the shovel are an unbiased representation of the contents of the bowl’s 2400 balls, thus

  • any result based on the sample of balls can generalize to the bowl, thus

  • the sample proportion \(\hat{p}\) of the \(n=50\) balls in the shovel that are red is a “good guess” of the population proportion \(p\) of the \(N =2400\) balls that are red, thus

  • instead of manually going over all the balls in the bowl, we can infer about the bowl using the shovel.

Definition 1.1 The sampling distribution of a Statistic (e.g. Mean, Median, Proportion, etc) is its probability distribution.

Definition 1.2 The standard deviation of a sampling distribution is called the standard error.

Example: This is the same table as above, but notice the 2nd column name.

Number of slots in shovel Standard Error of proportions red
25 0.0978
50 0.0669
100 0.0471
Exercise 1.2 Find and plot the sampling distribution of the proportion (\(\hat{p}\)) of heads when you flip a fair coin. (Use 5000 sets of 10 tosses)
  1. What is the sample size?

  2. How many experiments?

  3. Find and plot the sampling distribution of \(\hat{p}\).

  4. Find the standard error of the sampling distribution of \(\hat{p}\).

  5. What happens to the standard error, when you increase the sample size?

1.5 DETOUR — Some brand name distributions

1.5.1 Normal Distribution

The normal distribution is defined by the following probability density function, where \(\mu\) is the population mean and \(\sigma\) is the standard deviation.

\[f(x) = \dfrac{1}{\sigma \sqrt{2 \pi}}e^{-(x-\mu)^2/{2\sigma^2}}\]

If a random variable \(X\) follows the normal distribution, then we write: \(X \sim N(\mu, \sigma^2)\)

Here is how the normal density looks like: Ex: here \(X \sim N(0, 1)\)

# Ignore this code

p1 <- ggplot(data = data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1)) + ylab("") +
  scale_y_continuous(breaks = NULL)
p1

R functions: (Package: stats)

Examples:

  1. Generate a random sample of 100 from \(N(15, 9)\) and create a histrogram.
x <- rnorm(n = 100, mean = 15, sd = 3)

ggplot(data.frame(x), aes(x = x)) + geom_histogram(binwidth = 1.5) 

  1. If \(X \sim N(15, 9)\) find the probability that X being greater than 21: \(P(X > 21)\)
pnorm(21, mean = 15, sd = 3) # pnorm gives us the left tail area to a given number, 21 in this case
[1] 0.9772499
  1. If \(X \sim N(15, 9)\) find the 25th quantile.
qnorm(.25, mean = 15, sd = 3)
[1] 12.97653
#pnorm(12.97653, mean = 15, sd = 3)

Question:

  1. A radar unit is used to measure speeds of cars on a motorway. The speeds are N(90 km/hr, 10 km/hr). What is the probability that a car picked at random is travelling at more than 100 km/hr?

  2. GMAT are roughly normally distributed with a mean of 527 and a standard deviation of 112. How high must an individual score on the GMAT in order to score in the highest 5%?

1.5.2 Exponential Distribution

The Exponential Distribution is defined by the following probability density function, where \(\dfrac{1}{\lambda}\) is the population mean and standard deviation.

\[f(x) = \lambda e^{-\lambda x}\]

If a random variable \(X\) follows the Exponential Distribution, then we write: \(X \sim Exp(\lambda)\)

Here is how the Exponential density looks like: Ex: here \(X \sim Exp(1/15)\)

# Ignore this code
x <- seq(0, 100, length.out=1000)
dat <- data.frame(x=x, px=dexp(x, rate=1/15))

ggplot(dat, aes(x=x, y=px)) + geom_line()

R functions: (Package: stats)

Examples:

  1. Generate a random sample of 100 from \(Exp(1/15)\) and create a histrogram.
x <- rexp(n = 100, rate = 1/15)

ggplot(data.frame(x), aes(x = x)) + geom_histogram(binwidth = 5) 

  1. If \(X \sim Exp(1/15)\) find the probability that X being less than 21: \(P(X > 21)\)
pexp(21, rate = 1/15)
[1] 0.753403
  1. If \(X \sim Exp(1/15)\) find the 75th quantile.
qexp(.75, rate = 1/15)
[1] 20.79442
#pexp(20.79442, rate = 1/15)

Question:

The number of days ahead travelers purchase their airline tickets can be modeled by an exponential distribution with the average amount of time equal to 15 days.

  1. Find the probability that a traveler will purchase a ticket fewer than ten days in advance.

  2. How many days do 80% of all travelers wait?

1.5.3 Binomial Distribution

The binomial distribution is a discrete probability distribution. It describes the outcome of n independent trials in an experiment. Each trial is assumed to have only two outcomes, either success or failure. If the probability of a successful trial is p, then the probability of having x successful outcomes in an experiment of n independent trials is as follows.

\[f(x) = {n \choose x} p^x (1-p)^{(n-x)} \quad \text{where x = 0, 1, 2,...,n}\]

Example: Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct.

  1. Find the probability of having exactly four correct answers if a student attempts to answer every question at random.

  2. Find the probability of having four or less correct answers if a student attempts to answer every question at random.

Solution:

Since only one out of five possible answers is correct, the probability of answering a question correctly by random is 1/5=0.2.

  1. By hand:

\({12 \choose 4} 0.2^4 (1-0.2)^{(12-4)} = 0.1329\)

In R:

dbinom(4, size=12, prob=0.2) 
[1] 0.1328756
  1. By hand:

\({12 \choose 4} 0.2^4 (1-0.2)^{(12-4)} + {12 \choose 3} 0.2^3 (1-0.2)^{(12-3)} + {12 \choose 2} 0.2^2 (1-0.2)^{(12-2)} + {12 \choose 1} 0.2^1 (1-0.2)^{(12-1)} + {12 \choose 0} 0.2^0 (1-0.2)^{(12-0)} = 0.9274\)

In R:

OR Alternatively,